Model Selection

Image description generation

# Image description generation

Kimi VL A3B Thinking 6bit

Kimi-VL-A3B-Thinking-6bit is a multilingual vision-language model converted based on the MLX format, supporting image-text to text tasks.

Transformers Other

Pixtral is a multimodal model based on the Mistral architecture that can handle image and text inputs and generate text outputs.

Pixtral 12b Nf4

A 4-bit quantized version based on the Mistral community's Pixtral-12B, focusing on image text-to-text tasks and supporting Chinese description generation.

Qwen2 Vl Tiny Random

This is a small debugging model randomly initialized based on the configuration of Qwen2-VL-7B-Instruct, used for vision-language tasks.

Blip Dalle3 Img2prompt

Fine - tuned based on the BLIP model, used to reverse - derive the possible prompt text used to generate an image from the image generated by DALL·E 3

Transformers Supports Multiple Languages

Blip2 Opt 2.7b 8bit

BLIP-2 is a vision-language pre-trained model that combines an image encoder and a large language model for image-to-text generation tasks.

Transformers English

Mediocreatmybest

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase